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Abstract 

High dimensional regression benefits from sparsity promoting regularizations. Screening rules leverage 
the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up 
solvers. When the procedure is proven not to discard features wrongly the rules are said to be safe. In 
this paper we derive new safe rules for generalized linear models regularized with ii and ^1/^2 norms. 
The rules are based on duality gap computations and spherical safe regions whose diameters converge to 
zero. This allows to discard safely more variables, in particular for low regularization parameters. The 
GAP Safe rule can cope with any iterative solver and we illustrate its performance on coordinate descent 
for multi-task Lasso, binary and multinomial logistic regression, demonstrating significant speed ups on 
all tested datasets with respect to previous safe rules. 


1 Introduction 

The computational burden of solving high dimensional regularized regression problem has lead to a vast 
literature in the last couple of decades to accelerate the algorithmic solvers. With the increasing popularity 
of ^i-type regularization ranging from the Lasso [18] or group-Lasso [24] to regularized logistic regression and 
multi-task learning, many algorithmic methods have emerged to solve the associated optimization problems. 
Although for the simple ii regularized least square a specific algorithm {e.g., the LARS [8]) can be considered, 
for more general formulations, penalties, and possibly larger dimension, coordinate descent has proved to be 
a surprisingly efficient strategy [12]. 

Our main objective in this work is to propose a technique that can speed-up any solver for such learning 
problems, and that is particularly well suited for coordinate descent method, thanks to active set strategies. 

The safe rules introduced by [9] for generalized £i regularized problems, is a set of rules that allows to 
eliminate features whose associated coefficients are proved to be zero at the optimum. Relaxing the safe rule, 
one can obtain some more speed-up at the price of possible mistakes. Such heuristic strategies, called strong 
rules [19] reduce the computational cost using an active set strategy, but require difficult post-precessing 
to check for features possibly wrongly discarded. Another road to speed-up screening method has been 
the introduction of sequential safe rules [21, 23, 22]. The idea is to improve the screening thanks to the 
computations done for a previous regularization parameter. This scenario is particularly relevant in machine 
learning, where one computes solutions over a grid of regularization parameters, so as to select the best one 
{e.g., to perform cross-validation). Nevertheless, such strategies suffer from the same problem as strong rules, 
since relevant features can be wrongly disregarded: sequential rules usually rely on theoretical quantities 
that are not known by the solver, but only approximated. Especially, for such rules to work one needs the 
exact dual optimal solution from the previous regularization parameter. 

Recently, the introduction of safe dynamie rules [6, 5] has opened a promising venue by letting the 
screening to be done not only at the beginning of the algorithm, but all along the iterations. Following a 
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method introduced for the Lasso [11], we generalize this dynamical safe rule, called GAP Safe rules (because 
it relies on duality gap computation) to a large class of learning problems with the following benefits: 

• a unified and flexible framework for a wider family of problems, 

• easy to insert in existing solvers, 

• proved to be safe, 

• more efficient that previous safe rules, 

• achieves fast true active set identification. 

We introduce our general GAP Safe framework in Section 2. We then specialize it to important machine 
learning use cases in Section 3 . In Section 4 we apply our GAP Safe rules to a multi-task Lasso problem, 
relevant for brain imaging with magnetoencephalography data, as well as to multinomial logistic regression 
regularized with ^1/^2 norm for joint feature selection. 

2 GAP Safe rules 

2.1 Model and notations 

We denote by [d] the set {1 ,... ,(i} for any integer d g N, and by the transpose of a matrix Q. Our 
observation matrix is F g where n represents the number of samples, and q the number of tasks or 

classes. The design matrix X = ..., has p explanatory variables (or 

features) column-wise, and n observations row-wise. The standard ^2 norm is written || • II2, the norm || • ||i, 
the ^00 norm || • ||oo- The ^2 nnit ball is denoted by S2 (or simply B) and we write B{c^r) the ^2 ball with 
center c and radius r. For a matrix B g we denote by ||B||2 = YFj=iYlk=i^\k Frobenius norm, 

and by (•, •) the associated inner product. 

We consider the general optimization problem of minimizing a separable function with a group-Lasso 
regularization. The parameter to recover is a matrix B g and for any j in . ^ow of 

B, while for any k in is the k-th column. We would like to And 

n 

B^^^ G argmin fi{xjB) + Afl(B) , ( 1 ) 

^ -V-^ 

^a(B) 

where fi : ^ M is a convex function with l/7-Lipschitz gradient. So F : B ^ i^ ^i^o 

convex with Lipschitz gradient. The function Q : IR+ is the ^1/^2 norm Q{B) = ll^ivb 

promoting a few lines of B to be non-zero at a time. The A parameter is a non-negative constant controlling 
the trade-off between data fitting and regularization. 

Some elements of convex analysis used in the following are introduced here. For a convex function 
/ : R^ [—00, +00] the Fenchel-Legendre transform^ of /, is the function /* : R^ [—00, -hoo] defined by 
f^{u) = sup^^]^d(z, 14) — f{z). The sub-differential of a function / at a point x is denoted by df{x). The 
dual norm of Q is the ^00/^2 norm and reads flH^(B) = maxj£[p] ||Fj7||2. 

Remark 1. For the ease of reading, all groups are weighted with equal strength, but extension of our results 
to non-equal weights as proposed in the original group-Lasso [ 24 ] paper would be straightforward. 

2.2 Basic properties 

First we recall the associated Fermat’s condition and a dual formulation of the optimization problem: 

^this is also often referred to as the (convex) conjugate of a function 
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Theorem 1. Fermat’s condition (see [ 3 , Proposition 26 . 1 ] for a more general result) 
For any eonvex funetion f : M; 

X E argmin/(x) 0 G df{x). 

xeR^ 

Theorem 2 ([9]). A dual formulation of (1) is given by 


0(A) 


n 


argmax- V /*(-A©i,:) 
0eAx 


Dx(&) 


( 2 ) 

( 3 ) 


where Ax = {© e : Vj e [p], ||:c(-^')"^©||2 <!} = {©£ < !}• The primal and dual 

solutions are linked by 

yi e [n], 0,^^) = -V/i(a:7B©))/A. (4) 

Furthermore, Fermat’s eondition reads: 


Vj G [p], X ^^^^G 



if ^ 0, 

if = 0. 


( 5 ) 


Remark 2. Contrarily to the primal, the dual problem has a unique solution under our assumption on fi. 
Indeed, the dual function is strongly concave, hence strictly concave. 


Remark 3. For any 0 g let us introduce G( 0 ) = [V/i( 0 i,:)^,..., V/n( 0 n,:)^] ^ Then the 

primal/dual link can be written 0^^^ = —. 


2.3 Critical parameter: A^ax 

For A large enough the solution of the primal problem is simply 0 . Thanks to the Fermat’s rule (2), 0 is 
optimal if and only if — VF( 0 )/A g df^(O). Thanks to the property of the dual norm this is equivalent to 
f^H^(VF(0)/A) ^ 1 where is the dual norm of Q. Since VF( 0 ) = X^G(O), 0 is a primal solution of Px if 

and only if A ^ Amax •= ^^^jE[p] ||^*^-^^^G(0)||2 = f^H^(X^G( 0 )). 

This development shows that for A ^ Amax, Problem (1) is trivial. So from now on, we will only focus 
on the case where A ^ Amax- 


2.4 Screening rules description 

Safe screening rules rely on a simple consequence of the Fermat’s condition: 

||^(i)'^©(A)||2<i^B57) =0 . (6) 

Stated in such a way, this relation is useless because 0 ^^^ is unknown (unless A > Amax)- However, it is often 
possible to construct a set 7 ^ ^ called a safe region, containing it. Then, note that 

maxllx^-^'^ 0 II 2 < 1 ^ = 0 . (7) 

The so called safe sereening rules consist in removing the variable j from the problem whenever the previous 
test is satisfied, since B^-^^ is then guaranteed to be zero. This property leads to considerable speed-up in 
practice especially with active sets strategies, see for instance [11] for the Lasso case. A natural goal is to 
find safe regions as narrow as possible: smaller safe regions can only increase the number of screened out 
variables. However, complex regions could lead to a computational burden limiting the benefit of screening. 
Hence, we focus on constructing IZ satisfying the trade-off: 

• 7 ^ is as small as possible and contains 0^^\ 

• Computing maxe^T^ \x^^^ 0||2 is cheap. 
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2.5 Spheres as safe regions 

Various shapes have been considered in practice for the set IZ such as balls (referred to as spheres) [9], domes 
[11] or more refined sets (see [23] for a survey). Here we consider the so-called “sphere regions” choosing 

a ball 7Z = B{c^r) as a safe region. One can easily obtain a control on max 0 £^('c,r) ©b by extending 

the computation of the support function of a ball [11, Eq. (9)] to the matrix case: max ||x^-^^^0||2 ^ 

©G23(c,r) 

c\2 + • 

Note that here the center c is a matrix in We can now state the safe sphere test: 

Sphere test: If +r||x ^'^'^||2 < 1, then = 0. ( 8 ) 

2.6 GAP Safe rule description 

In this section we derive a GAP Safe screening rule extending the one introduced in [ 11 ]. For this, we rely 
on the strong convexity of the dual objective function and on weak duality. 


Finding a radius: Remember that Vi g [n],/^ is differentiable with a l/y-Lipschitz gradient. As a 
consequence, Vi g [n],/* is y-strongly convex [14, Theorem 4.2.2, p. 83] and so Dx is yA^-strongly concave: 

V(0i, 02) E X R”x«, Dx{e2) ^ DxiOi) + <VZ)a(0i), 02 - 0i> - ^||0i - 02|p. 

Specifying the previous inequality for 0i = 0^^\ 02 = 0 g Ax, one has 

L>a(0) ^ -C>a(0^^^) + <VL>a(0 ^^)), 0 - - A||©(0 _ ©||2^ 

By definition, 0^^^ maximizes Dx on Ax, so we have: Q — ^ 0. This implies 

Da(0 ) < - A||e(A) _ ©||2, 

By weak duality VB eRPX9,£)^(©(0) ^ Pa(B), so : VB e Rpx 3,V0 e Ax,Dx(e) ^ Pa(B) - “ ©P’ 

and we deduce the following theorem: 


Theorem 3. 

VBgM^^^V0g Ax, 


©(© - 0 




l2{Px{B)-Dxm 


7A2 


= :rA(B,0). 


(9) 


Provided one knows a dual feasible point 0 g Ax and a B g , it is possible to construct a safe 

sphere with radius rA(B, 0) centered on 0. We now only need to build a (relevant) dual point to center such 
a ball. Results from Section 2.3, ensure that —G(0)/Amax ^ Ax, but it leads to a static rule, a introduced 
in [9]. We need a dynamic center to improve the screening as the solver proceeds. 


Finding a center: Remember that ©A) = —G(XBA))/A. Now assume that one has a converging algo¬ 
rithm for the primal problem, ie., B^). Hence, a natural choice for creating a dual feasible point Ok 

is to choose it proportional to —G{XBk), for instance by setting: 


0/c = 


Rk 
A ’ 

Rk 

n:^{x'^Rk)^ 


if < A, 

otherwise. 


where Rk = —G{XBk) • 


( 10 ) 


A refined method consists in solving the one dimensional problem: argmax 0 ^^^^gpg^j^('^^^ Da( 0)- In the 
Lasso and group-Lasso case [5, 6 , 11] such a step is simply a projection on the intersection of a line and the 
(polytope) dual set and can be computed efficiently. However for logistic regression the computation is more 
involved, so we have opted for the simpler solution in Equation (10). This still provides converging safe rules 
(see Proposition 1). 
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Dynamic GAP Safe rule summarized 

We can now state our dynamical GAP Safe rule at the k-th step of an iterative solver: 

1. Compute B/c, and then obtain and rx{Bk, Q^) using (10). 

2. If Qkh + rA(B/c, II 2 < l,th6a set = 0 and remove from X. 

Dynamic safe screening rules are more efficient than existing methods in practice because they can 
increase the ability of screening as the algorithm proceeds. Since one has sharper and sharper dual regions 
available along the iterations, support identification is improved. Provided one relies on a primal converging 
algorithm, one can show that the dual sequence we propose is converging too. 

The convergence of the primal is unaltered by our GAP Safe rule: screening out unnecessary coefficients 
of Bk can only decrease its distance with its original limits. Moreover, a practical consequence is that one 
can observe surprising situations where lowering the tolerance of the solver can reduce the computation time. 
This can happen for sequential setups. 

Proposition 1. Let B^ be the eurrent estimate ofB^^^ and defined in Eq. (10) be the eurrent estimate 
of Then lim/c ^+00 B/^ = B^^^ implies lim/c ^+00 ©/c = 0^^^. 

Note that if the primal sequence is converging to the optimal, our dual sequence is also converging. But 
we know that the radius of our safe sphere is {2{Px{Bk) — Da(0/c))/(7 A^))^/^^. By strong duality, this radius 
converges to 0, hence we have certified that our GAP Safe regions sequence S(0/c, fA(B/c, 0/c)) is a converging 
safe rules (in the sense introduced in [11, Definition 1]). 

Remark 4. The active set obtained by our GAP Safe rule (be., the indexes of non screened-out variables) 

converges to the equicorrelation set [20] £x := {j e P • 0^^^||2 = 1}, allowing us to early identify 

relevant features (see Proposition 2 in the supplementary material for more details). 

3 Special cases of interest 

We now specialize our results to relevant supervised learning problems, see also Table 1. 

3.1 Lasso 

In the Lasso case q = I, the parameter is a vector: B = fi e F(/3) = l/2||p — XfiWl = YH=i{yi ~ ^ 

meaning that fi{z) = {pi — zfi/2 and = \\P\\i- 

3.2 multi-task regression 

In the multi-task Lasso, which is a special case of group-Lasso, we assume that the observation is T g 
F(B) = i||F - XBlIi = I Sti IVv - 7B||i {i.e., Mz) = - zf/2) and fi(B) = Y.U l|Bi,|| 2 . In signal 

processing, this model is also referred to as Multiple Measurement Vector (MMV) problem. It allows to 
jointly select the same features for multiple regression tasks [1, 2]. 

Remark 5. Our framework could encompass easily the case of non-overlapping groups with various size 
and weights presented in [6]. Since our aim is mostly for multi-task and multinomial applications, we have 
rather presented a matrix formulation. 
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3.3 £i regularized logistic regression 

Here, we consider the formulation given in [7, Chapter 3] for the two classes logistic regression. In such a 
context, one observes for each i g [n] a class label q g {1, 2}. This information can be recast as 
and it is then customary to minimize ( 1 ) where 

n 

= Xl {-yixjl3 + log (l + exp {xjl3))), (11) 

i = l 


with B = P E {i.e., g = 1 ), fi{z) = —yiZ + log(l + exp( 2 ))) and the penalty is simply the £i norm: 
ft{/3) = II yd 111 . Let us introduce Nh, the (binary) negative entropy function defined by 


Nh(x) 


{ X log(x) + (1 — x) log(l — x), if X G [0,1] , 
+ 00 , otherwise . 


Then, one can easily check that f^{zi) = Nh(zi + i/i) and 7 = 4. 


( 12 ) 


3.4 ^ 1/^2 multinomial logistic regression 

We adapt the formulation given in [7, Chapter 3] for the multinomial regression. In such a context, one 
observes for each i g [n] a class label q g { 1 ,..., g}. This information can be recast into a matrix Y g 
filled by O’s and I’s: Yi^k = l{ci=/c}- same spirit as the multi-task Lasso, a matrix B g is 

formed by g vectors encoding the hyperplanes for the linear classification. The multinomial ^ 1/^2 regularized 
regression reads: 


q 

Yj + log 

i=l \k=l 

with fi{z) = + log (2fc=i oxp (zk)) to recover the formulation as in (1). Let us introduce NH, 

the negative entropy function defined by (still with the convention Olog(O) = 0) 


m) = X 



NH(a;) 


ELi log(^*)> if a; e S, = {a; 6 : Ei=i 

+ 00 , otherwise. 


(14) 


Again, one can easily check that f^{z) = NH( 2 ) + T^^;) and 7 = 1 . 

Remark 6. For multinomial logistic regression, Dx implicitly encodes the additional constraint 0 g doml^A = 
{0' : Vi G [n], — A0' . + e where Yq is the g dimensional simplex, see (14). As 0 and R/c/A both 
belong to this set, any convex combination of them, such as Qk defined in (10), satisfies this additional 
constraint. 


Remark 7. The intercept has been neglected in our models for simplicity. Our GAP Safe framework can 
also handle such a feature at the cost of more technical details (by adapting the results from [15] for instance). 
However, in practice, the intercept can be handled in the present formulation by adding a constant column 
to the design matrix X. The intercept is then regularized. However, if the constant is set high enough, 
regularization is small and experiments show that it has little to no impact for high-dimensional problems. 
This is the strategy used by the Liblinear package [10]. 


^with the convention Olog(O) = 0 
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Lasso 

Multi-task regr. 

Logistic regr. 

Multinomial regr. 

fi{z) 

2 

2 

log(l + e^) - ViZ 

k=l k=l 

f*{u) 

{Vi-uf-Vi 

2 

2 

Nh(M + yi) 

NH(it + Yi,,) 

fJ(B) 

mil 

to 

ll/3||i 

L W^Y.h 

J = 1 

Amax 

ll^^ylloo 


||XT(l„/2-y)||oo 

n^{X^{lnxg/q-Y)) 

G(0) 

0 -y 

e-Y 


RowNorm(e®) — Y 

7 

1 

1 

4 

1 


Table 1: Useful ingredients for computing GAP Safe rules. We have used lower case to indicate when the 
parameters are vectorial {i.e., q = 1). The function RowNorm consists in normalizing a (non-negative) 
matrix row-wise, such that each row sums to one. 


4 Experiments 

In this section we present results obtained with the GAP Safe rule. Results are on high dimensional data, 
both dense and sparse. Implementation have been done in Python and Gython for low critical parts. They are 
based on the multi-task Lasso implementation of Scikit-Learn [17] and coordinate descent logistic regression 
solver in the Lightning software [4]. In all experiments, the coordinate descent algorithm used follows the 
pseudo code from [ 11 ] with a screening step every 10 iterations. 

Note that we have not performed comparison with the sequential screening rule commonly acknowledge 
as the state-of-the-art “safe” screening rule (such as th EDDP+ [21]), since we can show that this kind of 
rule is not safe. Indeed, the stopping criterion is based on dual gap accuracy, and comparisons would be 
unfair since such methods sometimes do not converge to the prescribed accuracy. This is backed-up by a 
counter example given in the supplementary material. Nevertheless, modifications of such rules, inspired by 
our GAP Safe rules, can make them safe. However the obtained sequential rules are still outperformed by 
our dynamic strategies (see Figure 2 for an illustration). 

4.1 ^i /^2 multi-task regression 

To demonstrate the benefit of the GAP Safe screening rule for a multi-task Lasso problem we used neu¬ 
roimaging data. Electroencephalography (EEG) and magnetoencephalography (MEG) are brain imaging 
modalities that allow to identify active brain regions. The problem to solve is a multi-task regression prob¬ 
lem with squared loss where every task corresponds to a time instant. Using a multi-task Lasso one can 
constrain the recovered sources to be identical during a short time interval [13]. This corresponds to a 
temporal stationary assumption. In this experiment we used a joint MEG/EEG data with 301 MEG and 
59 EEG sensors leading to n = 360. The number of possible sources is p = 22 ,494 and the number of time 
instants q = 20. With a 1 kHz sampling rate it is equivalent to say that the sources stay the same for 20 ms. 

Results are presented in Figure 1. The GAP Safe rule is compared with the dynamic safe rule from [ 6 ]. 
The experimental setup consists in estimating the solutions of the multi-task Lasso problem for 100 values of 
A on a logarithmic grid from A^ax to Amax/10^. For the experiments on the left a fixed number of iterations 
from 2 to 2 ^^ is allowed for each A. The fraction of active variables is reported. Figure 1 illustrates that the 
GAP Safe rule screens out much more variables than the compared method, as well as the converging nature 
of our safe regions. Indeed, the more iterations performed the more the rule allows to screen variables. On 
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-logio(A/A 

max) 



- 0.6 

- 0.5 

- 0.4 



8000 

No screening 



Figure 1: Experiments on MEG/EEG brain imaging dataset (dense data with n = 360, p = 22494 and 
q = 20). On the left: fraction of active variables as a function of A and the number of iterations K. The 
GAP Safe strategy has a much longer range of A with (red) small active sets. On the right: Computation 
time to reach convergence using different screening strategies. 


the right, computation time confirms the effective speed-up. Our rule significantly improves the computation 
time for all duality gap tolerance from 10“^ to 10“^, especially when accurate estimates are required, e.g., for 
feature selection. 


4.2 £i binary logistic regression 

Results on the Leukemia dataset are reported in Eigure 2. We compare the dynamic strategy of GAP Safe 
to a sequential and non dynamic rule such as Slores [22]. We do not compare to the actual Slores rule as it 
requires the previous dual optimal solution, which is not available. Slores is indeed not a safe method (see 
Section B in the supplementary materials). Nevertheless one can observe that dynamic strategies outperform 
pure sequential one, see Section C in the supplementary material). 



-logio(A/A„^J 


2000 

No screening 

GAP Safe (sequential) 

Hi GAP Safe (dynannic) 
1500 ^ 



Eigure 2: ii regularized binary logistic regression on the Leukemia dataset (n = 72 ; m = 7,129 ; q = 1). 
Simple sequential and full dynamic screening GAP Safe rules are compared. On the left: fraction of the 
variables that are active. Each line corresponds to a fixed number of iterations for which the algorithm is 
run. On the right: computation times needed to solve the logistic regression path to desired accuracy with 
100 values of A. 
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4.3 ^i /^2 multinomial logistic regression 

We also applied GAP Safe to an ^ 1/^2 multinomial logistic regression problem on a sparse dataset. Data are 
bag of words features extracted from the News20 dataset (TF-IDF removing English stop words and words 
occurring only once or more than 95% of the time). One can observe on Figure 3 the dynamic screening and 
its benefit as more iterations are performed. GAP Safe leads to a significant speedup: to get a duality gap 
smaller than 10“^ on the 100 values of A, we needed 1,353 s without screening and only 485 s when GAP 
Safe was activated. 



max) 


|-1.0 
■- 0.9 

I- 0.8 

- 0.7 

- 0.6 

- 0.5 

- 0.4 

|H 

■- 0.0 


Figure 3: Fraction of the variables that are active for 
ii/i 2 regularized multinomial logistic regression on 
3 classes of the News20 dataset (sparse data with n 
= 2,757 ; m = 13,010 ; g = 3). Computation was run 
on the best 10% of the features using univariate 
feature selection [16]. Each line corresponds to a 
fixed number of iterations for which the algorithm 
is run. 


5 Conclusion 


This contribution detailed new safe rules for accelerating algorithms solving generalized linear models regu¬ 
larized with £i and ^ 1/^2 norms. The rules proposed are safe, easy to implement, dynamic and converging, 
allowing to discard significantly more variables than alternative safe rules. The positive impact in terms 
of computation time was observed on all tested datasets and demonstrated here on a high dimensional re¬ 
gression task using brain imaging data as well as binary and multiclass classification problems on dense and 
sparse data. Extensions to other generalized linear model, e.g., Poisson regression, are expected to reach the 
same conclusion. Euture work could investigate optimal screening frequency, determining when the screening 
has correctly detected the support. 
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Supplementary Material 
A Proofs 

A.l Proof of variable identification 

Proposition 2. There exists ko e N sueh that for all k ^ ko, an index j e [p] is sereened out by the GAP 

Safe rule if and only iff e Ex := {j ^ p ' II2 = 1}. 

Proof For simplicity we use the notation IZk = S(0/c, 0/^)) for the safe region at step k. Define 

maxj^£:^ \x^^^ 0^^^| = t < 1. Fix e > 0 such that e < (1 — t)/(maxj^£:^ As Ok is converging to 0^^\ 

and lim/c^oo rA(B/c, 0/c) = 0, there exists ko e N such that \/k ^ /co,V0 g Pk^ I© “ 0^^^|| ^ e. Hence, for 

any j ^ Ex and any 0 g Pki ^ (maxj^£:^ ||)||0 — 0^^^|| ^ (maxj^£:^ ||)e. Using the 

triangle inequality, one gets 

\x^A 0| ^(max ||)e + max 0^^^| 

^(max \\x^^^ ||)e + t < 1, 

provided that e < (1 — t)/(maxj^£:^ Hence, for all k ^ ko^j ^ Ex implies that j is screened out by the 

GAP Safe rule thanks to the last inequality. For the reverse inclusion take j e Ex, i^e., = 1. Since 

by construction of our GAP Safe screening rule \/k e N, 0^^^ g Pk, then j e [f e [p] : maxe^T^^ \x^^ ^ 0| ^ 

1}. This means that the variable j can not be eliminated by our safe rule, and we have shown that in the 
limit we have exactly identified the equicorrelation set. □ 

A.2 Proof that the GAP Safe rule is converging (Proposition 1) 

Proof We consider two cases. 

First let us assume that Ok = Rk/El:^{X~^G{XBk)) 


Ok - 0^^^ 


-G{XBk) 




< 


< 


G{XBk) 


G{XBk) 


A 


1 


G{X&^')) - G(XBk) 


nXX^G{XBk)) 


l|G(XB,)||2 + 


^(XB^^)) - G(XBfc) 


A 


The second term converges to zero whenever B^ ^ B^^) since G is continuous (it is 7-Lipschitz). For the 
first term, note that f2*(X''"G(XBfc)) ^ f2*(X'''G(XB(^))) = Af2*(X''"©(^)) = A (thanks to the primal/dual 
link, and that 0*^^^ is dual feasible). Then, as G is a Lipschitz function and all norms are equivalent in a 
finite dimension space, the right hand side converges to zero in the previous inequality, and the results stated 
follows. 


In the second case Ok = Rk/^, so 
the first case. 


0fe - 0(^) 


-G(XBfc) + G(XB(^b 
A 


and the proof proceeds as in 


□ 
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B EDPP is not safe 


In the two last sections, we present a study on the EDDP method [21], a screening rule that relies on the 
dual optimal point obtained for the previous A in the path. Note that the same conclusion would hold 
true for generalization of the sequential approach given in [22], as well as for any other screening rule that 
needs exact dual solution at one step. To simplify the reading we use the vectorial (with no capital letters) 
notation used earlier. In the remainder we consider Aq = Amax and a non-increasing sequence of T — 1 tuning 
parameters {Xt)tE[T-i] ia (0, Amax)- In practice, we choose the common grid [7] [2.12.1]): At = 

Wang et al. [21] proposed a sequential screening rule based on properties of the projection onto a convex set. 
Their rule is based on the exact knowledge of the true optimal solution for the previous parameter. Such 
a rule can be used to compute 0^^^^ since 0^^^^ = y/Xo {= ^/Amax) is known. However for t > 1, 0^^^^ is 
only known approximately and the rules introduced in [21] are not safe anymore: some active groups may 
be wrongly disregarded if one does not use the exact value of 

We first first recall the property they proved. Then, we give a counter-example that shows that the rule 
is indeed not safe. In Section C, we propose to modify their rule in order to make it safe in all cases. 

Recall that in this case g = 1, the parameters are vectors: B = f3 eW and & = 0 eW^. 


Proposition 3 ([21, Theorem 19]). Assume that Xt-i < A^ax; then the dual optimal solution of the group- 
Lasso with parameter Xt, satisfies 


where 


and 


?;^(At_i, At) = f - 




= arg mm 

ckgM+ 


]L _ 

Xt Xt-1 




(15) 


(16) 


Note that the rule proposed by [21] (as pointed out in [6]) relies on the exact knowledge of a dual optimal 
solution for a previously solved Lasso problem. This is impossible to obtain in practice and even if it is 
possible to find accurate solutions, the search for high accuracy may hinder the benefits of the screening 
when it was not actually needed. Using inaccurate solutions may lead to discarding variables that should 
have been active and so the screened optimization algorithm will not converge to a solution of the original 
problem. 

We illustrate this issue on Figure 4. Knowing an approximation fi to the optimal primal point, re¬ 
turned by the optimization algorithm at the previous regularization parameter A^-i, we need to choose an 
approximation 0 to the optimal dual point to run EDPR 

• If we choose to approximate the dual optimal point by ^ ~ (blue curve with diamonds), 

then the result is catastrophic. Indeed, at Ai, /3 = 0 is a valid e-solution for e = 10“^‘^ and the 
screening rule tries to perform a division by 0 when computing a\0^^. 

• If we choose to approximate the dual optimal point by max(At i\\x^{y-xy)\\ have a better 

behavior (purple curve with triangles) but we may still have an algorithm which does not converge to 
an e-solution. Here, for the 13^^ Lasso problem a variable is erroneously removed and the problem can 
only be solved to accuracy 0.03515 > 10“^‘^ ^ 0.03162. This may look like a small issue but when the 
stopping criterion is based on the duality gap, this causes the algorithm to continue until the maximum 
number of iterations is reached. 
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1/V2 

V2/V3 


1/V6 


0 

-1/V6 

, y = 

1/V6 


-1/V2 

-i/Ve 


-V2/V3 


Figure 4: EDPP is not safe. We run GAP SAFE and two interpretations of EDPP (described in the main 
text) to solve the Lasso path on the dataset defined by X and y above with target accuracy 10“^-^. Eor each 
Lasso problem, we plot the final duality gap returned by the optimization solver. 


C Making EDDP screening rule safe 


C.l The simpler screening rule 

In the present paper, we give computable guarantees on the distance between the current dual feasible point 
and the solution of the problem. We show here how we can combine our result with Wang et al. ’s in order 
to make their screening rule work even with approximate solutions to the previous Lasso problem. 

Eor simplicity, we first consider the initial version of Wang et al. ’s sphere test: 

e (17) 

proved in [21, Theorem 7]. As we do not know we cannot readily use this ball. However, we can 

modify it to make it a safe screening rules as follows: 

Proposition 4. Assume that Xt-i < Amax; 0 g Ax is a dual feasible point and rx^_^ > is a radius 
satisfying ^ , then 




A At-1 


,)■ 


where 


o[6>] :=argmin 


At At-i 




and for any f 6 R, {t)+ = max(0, f). 

Proof. Start first by noting that (17) implies 


1 I B(e', min 

e'EB(e,rx,_A 


y O' ( y 

— - 0 -a{- - 0 ^ 

At At-1 


)■ 


(18) 


(19) 


Let us denote 


H = max min 
9'EB{9,rx^_A 


y O' ( y 


O') 


2 
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then G B{0,rx^_^ + H). We now need to upper bound H. A simple choice is to take a to be a[0] 
defined in Eq. (19) The motivation for such a choice is because it is optimal when rx^_^ = 0. This provides 
the following bound on H: 


H ^ max 


At-i 


]L_e- a[e]{^ -e)+ (a[0] - 1) 

A A -1 


9) 


< rXt-MiO] - 1 | + 

Hence, after some simplifications: 


A A-i 


( 20 ) 


0(^*)EB(0,rA._,(l+|l-a[0]|) + 


^^-e-a[9]{ 


y 

At-i 



□ 


Remark 8. In the case that ||^/At-i|| ^ ||^/At-i — ^|| ^ 1 then with the definition of a\0^^ and the Cauchy- 
Schwartz inequality one has that 1 + |(a[6>] — 1| ^ This means that the multiplicative ratio in front of 

rxt-i is At-i/At. In [11, Proposition 3], the bound obtained would only lead to the smaller ratio: x/Xt-i/Xt^ 

Remark 9. From the proof of Theorem 7 in [21], it holds that for A < Amax then 



< M ^ §(A) g B 

X 



( 21 ) 


C.2 The complete screening rule (EDDP+) 

Let us now consider the EDDP+ screening rule [21] relying on the property (15): e + 

^i;^(At-i, At), I ||i;^(At-i, At )||2 )• Using the same technique as for Proposition 4, we can strengthen our 
previous proposition with the following result. 

Proposition 5. Assume that At-i < Amax; 0 e Ax is a dual feasible point and rxt_^ >{) is a radius 
satisfying ^ B{0^rxt_^)- Define a[0] as in (19), 


_ |l-o[^]| + l + o[^] 1 

At 2 ^*-1 2 


f-e-a[e]i^-e) 

M M-i 


X _ y 

At At-i 


. ^At-i 


2|l* 


At-i 


-9 


2 rA,_i) 


and 


v^{9,\t-u\t) = f-9-a[9]{X^-9). 

M M-i 


Then eB(9+ ^v^{9, At_i, A*), r^,). 

Proof As before, we do not know exactly but we know that denoting 


At-i, At) = ^-0'- - O') 

M M-i 


with 




Wxh-O'Wl , 


( 22 ) 


(23) 


( 24 ) 
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we have 


e'eB{e,rx,_,) 


Our goal is to find a ball centered at ^ At-i, At) that contains all these balls, thus containing 

First, reminding (20) 


|u-^(6»', At_i, At)|| = min 

' aGM+ 


y c^i f y ^/\ 

--0 -a{- - 0) 

M M-i 


^ max min 

9'EB(9,rx^_^) 

< rxt_, |1 -a[6»]| + 


y-e'-a{^-e') 


M M-i 


We continue as 


0' + ^v^i0',Xt_,, Xt)-0- ^v^i0, Xt-i,Xt) 


{O' - ^) + ATT -O')-1-^+0 + «[^](^ - ^)) 

1(0'-0- (a[0'] - - O') + a[0]{0' - 0)). 


Taking the norm on both sides of the previous display. 


0' + ^v^{0', Xt-i,Xt) - 0 - ^v^{0, Xt-i,Xt) 


< 


1 + a{0] 11 ^, ^11 ^ | q :[^'] - q :[^]| 


At-i 


- 0 ' 


Now, reminding that x ^ (a;)+ is a 1-Lipschitz function, 


(y,\0'^ — (a[^] 


< 




I 

I At_i 


/ y _ f)' y. _ y_\ / y — n M _ y \ 

1 ^ ^t—1^ -t ^ ^t — 1 ^ ^t—1'^ -i 

~l~ J- Ti ^ I I o -L 


j!^-o'\\1 


j!^-o\\I 


<ll)^ - owiixt. - ^0 - llx^ - o'WUxti - - A^> 


llA^-^'llillA^-^lli 


< 


< 


< 



_ y 

At At-i 

to 

II* 

-^'111* 

_ y 

At At-i 

r-^lli 

2 

l*-^'ll2 

M _ y_ 

At At-i 

l*-^lli 

Jl^-^'ll2 

l*-^'ll2 

l*-^lli 


\^-o\\l-\\^- 0 '\\l 


^t-1 


y o' + 0, 


2\\0-0'\\2 + \\o-o'\\J^^-o'h) 


(2||^ - 0h + 11^ - O'W^ + llx^ - Oh + ||0 - O'W^). (25) 


where the second inequality comes from the triangle inequality and the Cauchy-Schwartz Inequality, and the 
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third is obtained by factorizing the difference of squares. Plugging this in the former, we get: 


e' + - 0 - 


M _ V- 

1 At At-: 


+2\\e-e%). 

-1 9 ^ 


One could check that there exists 0' g satisfying 0^^^^ g B{0'+\v^{0', At-i, At), \ Xt-i,Xt )\\2 ) 

and so combining the last inequality with (25) 

§(M) _0_^y±(0^Xt-uh) < -0' -^v^{e',Xt-i,Xt) 

+ 0'+-v^{0', Xt-i, Xt) — 0 —-v^{0, Xt-i, Xt) ^ 

Z z At At-I o 


y_ _ y 

At At-i 


jy _ ^\\‘2 


-1 Il2 
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